0%

(NIPS 2017) Dynamic Routing Between Capsules

Keyword [Classification] [Image Reconstruction] [CapsulesNet]

Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules[C]//Advances in neural information processing systems. 2017: 3856-3866.


1. Overview


基于认知神经科学领域中芒卡斯尔(V.B. Mountcastle)发现的大脑皮层功能柱结构(功能柱-微柱),论文提出capsule概念。

1.1. Capsule定义

  • Capsule是一组神经元,其activity vector表示一个特定实体(entity)类型(object or object part)的实例化参数(instantiation parameters).
  • Activity vector的长度表示entity存在的概率方向表示instaniation parameters.
  • Instantiation parameters includes pose (position, size, orientation), deformation, velocity (速度), albedo (反照率), hue, texture etc.

1.2. Capsule机制

L level的activity $capsule_i$ 通过transformation matrices predict L+1 level $capsules_j$的instantiation parameters.

当多个predictions agree, L+1层中某个$capsule_j$ become active. (即高层$capsule_j$代表了低层$capsule_i$的组合,如数字2的多个低层特征由多个$capsule_i$表示,同时满足这些特征时,某个代表这些特征组合的高层$capsule_j$被激活)

在CapsNet中使用了iterative routing-by-agreement机制:对于L+1 level中某个$capsule_j$的 activity vectors ($v_j$)和L level中的一个$capsule_i$对其作出的prediction ($\hat{u_{j|i}}$)而言,两者scalar product ($\hat{u_{j|i}} · v_j$)越大, A越倾向于将其output发送给这个$capsule_j$.

CapsNet在MNIST数据集上优于CNN,最高达到99.75%左右。


2. Introduction


2.1. 假设

  • 假设,人类的多层视觉系统会在每个fixation上建立一个类似parse tree的东西。
  • 假设,对于一个fixation,parse tree可以通过对一个固定的多层神经网络carve (类似于剪枝)得到。

每层会被划分为多个神经元组(capsule), parse tree中每个node都会对应一个active capsule (即parse tree中每个node代表一层神经网络中的一组神经元,该组神经元可称为capsule).

通过iterative routing机制,每个active capsule都会选择上一层中的某个capsule作为parent node. 对于higher level视觉系统,iterative routing机制将用于解决物体部分组合到整体的过程

2.2. 存在性表示方法

  1. logistic unit.
  2. length of the vector of instantiation parameters (squashing function, 论文采用).

2.3. Dynamic Routing

  • 由于capsule输出的是向量,用dynamic routing机制能够确保capsule的输出send to合适的parent node.
  • 对于L level的capsule A和L+1 level的capsule B. A首先计算prediction vector (A的output乘以weight matrix), 然后计算prediction vector与B ouput的scalar product, 反馈给AB之间的coupling coefficient.
  • scalar product越大,对AB之间coupling coefficient的反馈越大,即-AB之间的coupling coefficient增加,A与其他parents之间的coupling coefficients减小(softmax function).

2.4. 分割重叠数字

  • 由于max pooling的routing机制会忽略local pool中大部分active feature detectors. 而routing by agreement more effective不存在这样的问题,因此能够分割高度重叠的object.
  • Max pooling throw away information about the precise position of the entity within the region.

2.5. 相比于CNN的改进

  • 使用vector output capsule代替CNN中的scalar output feature detectors.
  • 使用routing by agreement代替max pooling.

2.6. Coding

  • For low level capsules, location information is place coded by which capsule is active.
  • As we ascend the hierarchy more and more of the positional information is rate coded in the real valued components of the output vector of a capsule.
  • 从place coding到rate coding表明higher level capsule用更高自由度represent更复杂的entities, 即capsule的维度随着层级的增加而增加

3. Capsule计算


3.1. Squashing function

(L level $capsule_i$, L+1 level $capsule_j$)

  • 使得较短向量长度缩放为0,较长向量长度缩放为1.


$v_j$. capsule_j的输出向量。
$s_j$. $capsule_j$的总输入向量。 即L level中所有$capsule_is$到L+1 level中特定一个$capsule_j$的prediction vector加权和。

3.2. Total input & prediction vectors



$u_i$. capsule_i的输出向量。
$w_{ij}$. weight matrix.
$u_{j|i}$. prediction value from a $capsule_i$ to a $capsule_j$.
$c_{ij}$. coupling coefficients between a $capsule_i$ and a $capsule_j$.

3.3. Coupling coefficients



$b_{ij}$. the log priors probabilities that $capsule_i$ should be coupled to $capsule_j$. 由两个capsule (i和j)的location和type决定,而非当前input image决定。

通过测量$v_j$和$u_j|i$之间的agreement迭代refine $c_{ij}$.

3.4. Agreement

This agreement is treated as if it were a log likelihood and is added to the initial logit, $b_{ij}$.

3.5. Algorithm




4. Margin Loss


SVM损失函数

  • $m^{+}=0.9$, $m^{-}=0.1$
  • $T_c=1$, iff a digit present
  • suggest $λ=0.5$



5. CapsNet结构 (Figure 1)




5.1. Conv1

256 kernels, 9x9, 1 stride, ReLu.

  • 输入维度 . (batch_size, 1, 28, 28)
  • 输出维度. (batch_size, 256, 20, 20)

5.2. PrimaryCapsules

  • 输入维度 . (batch_size, 256, 20, 20)
  • 输出维度. (batch_size, 32, 6, 6, 8)

激活primary capsule过程可看作是图像render的逆过程。

  • PrimaryCapsules is a convolutional capsule layer, 包含32个通道每个通道含有一个convolutional 8D capsules. 即每个primary capsule包含8个(9x9, 2 stride) conv unit.
  • PrimaryCapsule共有(32, 6, 6)个capsule输出,每个输出是一个8维向量。 (6x6) grid中的capsule share weights,与CCN的卷积核原理相同。每个capsule输出a grid of vectors,而不是single vector output.

5.3. DigitCaps

  • 输入维度. (batch_size, 3266, 8)
  • 输出维度. (batch_size, 10, 16)

  • 10个16维的capsule.

  • CapsNet中,只在PrimaryCapsules和DigitCaps层之间routing. 由于Conv1输出是1D, 不存在orientation to agree on, 因此不在Conv1和PrimaryCapsules层之间routing.

使用Adam optimizer, routing logit $b_ij$初始化为0.

5.4. Reconstruction (Figure 2)



  • 输入维度 . (batch_size, 10*16) 或 (batch_size, 16)
  • 输出维度. (batch_size, 28* 28)

  • Encourage the digit capsules to encode the instantiation parameters of the input digit.

  • 训练阶段mask除了ground truth对应capsule之外的所有capsule.
  • 测试阶段mask除了length最大的capsule之外的所有capsule.
  • 目标为最小化sum of squared differences. Reconstruction loss乘以0.0005系数

  • To summarize, by using the reconstruction loss as a regularizer, the Capsule Network is able to learn a global linear manifold between a whole object and the pose of the object as a matrix of weights via unsupervised learning.

  • As such, the translation invariance is encapsulated in the matrix of weights, and not during neural activity, making the neural network translation equivariance.
  • Therefore, we are in some sense, performing a ‘mental rotation and translation’ of the image when it gets multiplied by the global linear manifold.



6. Capsules on MNIST


6.1. 实验结果



  • 使用routing和reconstruction regularizer能够提升性能
  • CNN: 24.56M parameters.
  • CapsNet: 11.36M parameters.

6.2. Capsule单个维度代表的意义 (Figure 4)

variations包括厚度、斜度和宽度。16维中几乎总有一维代表数字的宽度,一些维度代表combinations of global variation, 其他一些代表localized part of digit.



6.3. Robustness to Affine Transformations

由于natural variance in skew, rotation, style, etc in hand written digits, 使用CapsNet具有健壮性

  • 训练集 MNIST
    digit placed randomly on a black background (平移).
  • 测试集 affNIST
    MNIST digit with a random small affine transformation

CapsNet stop when $train_{acc}=99.23%$, $test_{acc}=79%$.
CNN stop when $train_{acc}=99.22%$, $test_{acc}=66%$.

7. Segmenting Highly Overlapping Digit


7.1. Routing可看做Attention机制

  • Dynamic routing可以看做是并行的attention机制,L+1 level的每个capsule都会attend一些L level的activive capsules,忽略其他的。因此,在对象重叠的情况下,也能识别多个对象
  • The routing by agreement should make it possible to use a prior about shape of objects to help segmentation.
  • It should obviate the need to make higher level segmentation decisions in the domain of pixels.

7.2. MultiMNIST dataset

  • 两张image各方向移动4个pixel,构成36x36 image.
  • MultiMNIST result shows that each digit capsule can pick up the style and position from the votes it is receiving from PrimaryCapsules layer.
  • top 2 length的capsule分别输入decoder中,得到两张digit images.


8. Other Datase


8.1. CIFAR10 (3 channel)

  • test_acc: 89.4%.
  • drawback 由于CIFAR10 image的背景变化太大,很难model合理大小的网络,导致性能较差。

8.2. smallNORB

  • test_acc: 97.3%.

8.3. SVHN

  • test_acc: 95.7%.

9. Discussion and previous work


Capsule将pixel intensities转化为vectors of instantiation parameters of recognized fragments, 然后对fragments作transformation matrices, 进而预测 the instantiation parameters of larger fragments.

9.1. CNN不足之处

  • Convolutional nets have difficulty in generalizing to novel viewpoints.
  • CNN具有translation (平移)不变性,但对于affine transformation (平移、旋转、缩放和斜切)不具有。因此,replicating feature detectors on a grid that grows exponentially with the number of dimensions, 或者指数级增加标注训练集。

9.2. Capsule

  • Capsule作了一个非常强的假设:对于image中的each location, a capsule表示至多一个entity的instance.
  • Capsule的神经元活动,会随着viewpoint (视角)的变化而变化(同变性equivariance),而不是消除viewpoint带来的影响(如STN网络)。能够同时处理不同object(或object part)上的不同affine transformation.

9.3. Routing iteration times